116
Binary Neural Architecture Search
TABLE 4.3
Search efficiency for different search strategies on ImageNet, including previous NAS
in both the real-valued and 1-bit search space, random search, and our DCP-NAS.
Method
T.P.
GGN
D.O.
Top-1 Acc.
Search Cost
Real-valued NAS
PNAS
-
-
-
74.2
225
DARTS
-
-
-
73.1
4
PC-DARTS
-
-
-
75.8
3.8
Direct BNAS
BNAS1
-
-
-
64.3
2.6
BNAS2-H
-
-
-
63.5
-
Random Search
51.3
4.4
Auxiliary BNAS
CP-NAS
-
-
-
66.5
2.8
DCP-NAS-L
71.4
27.9
DCP-NAS-L
71.2
2.9
DCP-NAS-L
72.6
27.9
DCP-NAS-L
72.4
2.9
Note: T.P. and D.O. denote Tangent Propagation and Decoupled Optimization, respectively.
the tangent direction constraint and the reconstruction error can improve the accuracy on
ImageNet. When applied together, the Top-1 accuracy reaches the highest value of 72.4%.
Then we conduct experiments with various values of λ and μ as shown in Figure 4.15. We
observe that with a fixed value of μ, the accuracy of Top-1 increases in the beginning with
increasing λ, but decreases when λ is greater than 1e-3. When λ becomes larger, DCP-NAS
tends to select the binary architecture with a gradient similar to that of its real-valued coun-
terpart. To some extent, the 1-bit model’s accuracy is neglected, leading to a performance
drop. Another phenomenon of performance variation is that the accuracy of Top-1 increases
first and then decreases with increasing μ while λ contains fixed values. Too much atten-
tion paid to minimizing the distance between 1-bit parameters and their counterparts may
introduce a collapse of the representation ability to 1-bit models and severely degenerate
the performance of DCP-NAS.
To better understand the acceleration rate of applying the Generalized Gauss-Newton
(GGN) matrix in the search process, we conducted experiments to examine the search cost
with and without GGN. As shown in Table 4.3, we compare the searching efficiency and the
accuracy of the architecture obtained by Random Search (random selection), Real-valued
NAS methods, Binarized NAS methods, CP-NAS, DCP-NAS without GGN method, and
DCP-NAS with GGN applied. In a random search, the 1-bit supernet randomly samples
and trains an architecture in each epoch, then assigns the expectation of all performance
to each corresponding edge and operations, and returns the architecture with the highest
score, which lacks the necessary guidance in the search process and therefore has poor per-
formance for binary architecture search. Notably, our DCP-NAS without GGN is highly
computationally consumed for the second-order gradient, which is necessarily computed in
the tangent propagation. Note that directly optimizing two supernets is computationally
redundant. However, the introduction of GGN for the Hessian matrix significantly acceler-
ates the search process, reducing the search cost to almost 10% with a negligible accuracy
vibration. As shown in Table 4, with the use of GGN, our method reduces the search cost
from 29 to 2.9, which is more efficient than DARTS. Additionally, our DCP-NAS achieves a